14 research outputs found
Recommended from our members
Efficient Inference, Search and Evaluation for Latent Variable Models of Text with Applications to Information Retrieval and Machine Translation
Latent variable models of text, such as topic models, have been explored in many areas of natural language processing, information retrieval and machine translation to aid tasks such as exploratory data analysis, automated topic clustering and finding similar documents in mono- and multilingual collections. Many additional applications of these models, however, could be enabled by more efficient techniques for processing large datasets.
In this thesis, we introduce novel methods that offer efficient inference, search and evaluation for latent variable models of text. We present efficient, online inference for representing documents in several languages in a common topic space and fast approximations for finding near neighbors in the probability simplex representation of mono- and multilingual document collections. Empirical evaluations show that these methods are as accurate as —- and significantly faster than —- Gibbs sampling and brute-force all pairs search respectively. In addition, we present a new extrinsic evaluation metric that achieves very high correlation with common performance metrics while being more efficient to compute. We showcase the efficacy and efficiency of our new approaches on the problems of modeling and finding similar documents in a retrieval system for scientific papers, detecting document translation pairs, and extracting parallel sentences from large comparable corpora. This last task, in turn, allows us to efficiently train a translation model from comparable corpora that outperforms a model trained on parallel data.
Lastly, we improve the latent variable model representation of large documents in mono- and multilingual collections by introducing online inference for topic models with hierarchical Dirichlet prior structure over textual regions such as document sections. Modeling variations across textual regions using online inference offers a more effective and efficient document representation, beyond a bag of words, which is usually a handicap for the performance of these models on large documents
Evons: A Dataset for Fake and Real News Virality Analysis and Prediction
We present a novel collection of news articles originating from fake and real
news media sources for the analysis and prediction of news virality. Unlike
existing fake news datasets which either contain claims or news article
headline and body, in this collection each article is supported with a Facebook
engagement count which we consider as an indicator of the article virality. In
addition we also provide the article description and thumbnail image with which
the article was shared on Facebook. These images were automatically annotated
with object tags and color attributes. Using cloud based vision analysis tools,
thumbnail images were also analyzed for faces and detected faces were annotated
with facial attributes. We empirically investigate the use of this collection
on an example task of article virality prediction